Content-centric Age and Gender Profiling Notebook for PAN at CLEF 2013
نویسندگان
چکیده
Author profiling can be considered a form of text analysis of which the objective is to ascertain characteristics of the author behind a text sample. This paper describe the design and implementation of an approach for determining the age group (10s, 20s, or 30s) and gender (male/female) of text samples for the author profiling task in PAN 2013. Evaluation is then based on the compounded accuracy in determining the correct age group and gender of authors of samples in a test corpus. The training corpus provided for this task contains English and Spanish text samples from online contents (e.g. blogs, chats) of authors. Content in each sample are split into one or more “conversations”, of which are all wholly attributed to a specific author. To the best of our knowledge, interweaving responses of other person(s)(if any) are filtered, focussing the scope of the analysis to the writing style and content present in individual author’s sample. The underlying research in this work is, thus, the empirical investigation of features that can be extracted from the text samples, that are helpful in identifying the gender and age group of an author based purely on characteristics present within his/her text samples. Main contribution in this work is a concise content-based feature based on similarity scores between given text samples and corpora of the different classes. This feature is compared and used with some common style-based, vocabulary and idiosyncrasies features. Results from experiments on a balanced subset of the PAN 2013 authorship profiling training corpus paint a clear contrast between the content-based feature and the other features, favouring the former for both the English and Spanish samples. Ultimately, 24 five-fold cross validation tests were ran on the different feature sets on the balanced corpus, with the best accuracies for simultaneous gender and age group classification at 48.52% and 61.23% for the English and Spanish samples respectively, in contrast to a baseline of 16.67%. 1 Task and General Approach Text analysis can involved processing users’ generated content for various purposes such as classification/clustering-based tasks like author attribution [7][10][4], plagiarism detection [12] and information retrieval related tasks such as information extraction, summarization of contents, etc. This work focus on the task of profiling the background/characteristics of groups of authors by analysing their text samples. In particular, the premise of this work is concerned with the author profiling task in PAN 2013 [1] profiling the age group (10s, 20s, or 30s) and gender (male/female) of authors using a provided training corpus. Similar to the task of author attribution, it is recognized here that the key research in author profiling is the selection for the best features and the proper use of classification techniques in building an appropriate model to distinguish between the different profile groups. Thus, in approaching this problem, we seek the question “Why/How do authors in different sociolinguistic profile group differs in their written communications, assuming using a common language?”. In general, there are 2 main contributing factors to differences in the communications amongst the authors in the different groups (i) content/subject matter difference as well as (ii) syntactic and style-based differences [2] amongst the different profile groups. Profiling text samples typically contain the sequential steps of describing the text sample (usually via features represented in a vector), investigation and selection of useful features and lastly, building a model to represent characteristics styles of each author or author group. In this work, we apply Principal Component Analysis (PCA) to linearly transform the high dimensional data into a lower dimensional space for a more simple representation of the data and subsequently utilize a popular implementation of Support Vector Machine (SVM) [3] classifier for learning the model for the author profiling task. The rest of this paper is organised as follows. Section 2 describe the motivation behind our approach to the PAN 2013 author profiling task, ending with a tabular listing of the features used. Section 3 present the results of the experiments in this work together with analysis of the various features used. Section 4 summarize the approach used and findings of our analysis in this work.
منابع مشابه
Using Simple Content Features for the Author Profiling Task Notebook for PAN at CLEF 2013
This paper describes the methods we have employed to solve the author profiling task at PAN-2013. Our goal was to use simple features to identify the age group and the gender of the author of a given text. We introduce the features, detail how the classifiers were trained, and how the experiments were run.
متن کاملAuthor Profiling using LDA and Maximum Entropy Notebook for PAN at CLEF 2013
This paper describes the traditional authorship attribution subtask of the PAN/CLEF 2013 workshop. In our attempt to classify the documents based on gender and age of an author, we have applied a traditional approach of topic modeling using Latent Dirichlet Allocation[LDA]. We used the content based features like topics and style based features like preposition-frequencies, which act as the eff...
متن کاملAuthor Profiling: Predicting Age and Gender from Blogs Notebook for PAN at CLEF 2013
Author profiling is the task of determining age, gender, native language or personality type of author by studying their sociolect aspect, that is, how language is shared by people. In this paper, we propose a Machine Learning approach to determine unknown author’s age and gender. The approach uses three types of features: content based, style based and topic based. We were able to achieve an a...
متن کاملStyle-based Distance Features for Author Verification Notebook for PAN at CLEF 2013
In this paper we present the approach we took in our participation to the PAN 2013 Author Profiling task. It is an adaptation of our system submitted for author identification, assuming that a profile category (authors belonging to the same gender and age group categories) can be analyzed in the same way as an author’s style.
متن کاملCan We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media Notebook for PAN at CLEF 2013
Would you target your audience differently, knowing the real age and gender of the text authors on your website forum? This paper examines hundreds of thousands of online documents, e.g. chat lines or blog posts, showing that computers are capable to address this task better than humans, without relying on content stereotypes. Pointing out that age and gender profiling are not independent probl...
متن کاملSemantic-based Features for Author Profiling Identification: First insights Notebook for PAN at CLEF 2013
In this article we present a semantic-based approach concerning the identification of particular author’s traits, such as age and gender, from social media texts. The model here described is intended to provide information on different levels of analysis: from textual markers to semantics. Different classifiers were used to assess the performance and scope of the model.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013